What is the impact on time on the life expectancy?
What is the impact of schooling on life expectancy?
How does Infant and Adult mortality rates affect life expectancy?
Do densely populated countries tend to have lower life expectancy?
What is the impact of GDP and income onn life expectancy?
Does Life Expectancy have positive or negative relationship with drinking alcohol?
Does Life Expectancy has positive or negative correlation with eating habits, lifestyle, alcohol?
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings
warnings.filterwarnings('ignore')
import statsmodels.formula.api as smf
import plotly.graph_objects as go
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
dataset = pd.read_csv('Life Expectancy Data.csv')
dataset.shape
(2938, 22)
# viewing columns and their data types
dataset.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2938 entries, 0 to 2937 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country 2938 non-null object 1 Year 2938 non-null int64 2 Status 2938 non-null object 3 Life expectancy 2928 non-null float64 4 Adult Mortality 2928 non-null float64 5 infant deaths 2938 non-null int64 6 Alcohol 2744 non-null float64 7 percentage expenditure 2938 non-null float64 8 Hepatitis B 2385 non-null float64 9 Measles 2938 non-null int64 10 BMI 2904 non-null float64 11 under-five deaths 2938 non-null int64 12 Polio 2919 non-null float64 13 Total expenditure 2712 non-null float64 14 Diphtheria 2919 non-null float64 15 HIV/AIDS 2938 non-null float64 16 GDP 2490 non-null float64 17 Population 2286 non-null float64 18 thinness 1-19 years 2904 non-null float64 19 thinness 5-9 years 2904 non-null float64 20 Income composition of resources 2771 non-null float64 21 Schooling 2775 non-null float64 dtypes: float64(16), int64(4), object(2) memory usage: 505.1+ KB
# removing the spaces to the right and left of the column names
dataset.columns = dataset.columns.str.rstrip()
dataset.columns = dataset.columns.str.lstrip()
# replacing remaining spaces in columns with under_bar
dataset.columns = dataset.columns.str.replace(' ', '_')
# replacing dashes in columns with under_bar
dataset.columns = dataset.columns.str.replace('-', '_')
# changing column headers to lowercase
dataset.columns = map(str.lower, dataset.columns)
# Checking the new format of the column names
dataset.columns.to_list()
['country', 'year', 'status', 'life_expectancy', 'adult_mortality', 'infant_deaths', 'alcohol', 'percentage_expenditure', 'hepatitis_b', 'measles', 'bmi', 'under_five_deaths', 'polio', 'total_expenditure', 'diphtheria', 'hiv/aids', 'gdp', 'population', 'thinness__1_19_years', 'thinness_5_9_years', 'income_composition_of_resources', 'schooling']
# checking for null value counts
dataset.isna().sum()
country 0 year 0 status 0 life_expectancy 10 adult_mortality 10 infant_deaths 0 alcohol 194 percentage_expenditure 0 hepatitis_b 553 measles 0 bmi 34 under_five_deaths 0 polio 19 total_expenditure 226 diphtheria 19 hiv/aids 0 gdp 448 population 652 thinness__1_19_years 34 thinness_5_9_years 34 income_composition_of_resources 167 schooling 163 dtype: int64
# checking for null value percentages
dataset.isnull().sum()/len(dataset)*100
country 0.000000 year 0.000000 status 0.000000 life_expectancy 0.340368 adult_mortality 0.340368 infant_deaths 0.000000 alcohol 6.603131 percentage_expenditure 0.000000 hepatitis_b 18.822328 measles 0.000000 bmi 1.157250 under_five_deaths 0.000000 polio 0.646698 total_expenditure 7.692308 diphtheria 0.646698 hiv/aids 0.000000 gdp 15.248468 population 22.191967 thinness__1_19_years 1.157250 thinness_5_9_years 1.157250 income_composition_of_resources 5.684139 schooling 5.547992 dtype: float64
# visualizaing the na values
import missingno as msno
msno.bar(dataset)
<AxesSubplot:>
dx = dataset.value_counts('country').sort_values(0).reset_index().head(20)
dx = dx.country[:10].tolist()
dx
['Dominica', 'San Marino', 'Cook Islands', 'Marshall Islands', 'Tuvalu', 'Saint Kitts and Nevis', 'Palau', 'Niue', 'Nauru', 'Monaco']
# removing countries from the dataset using list in the previously created variable dx
dataset = dataset[dataset.country.isin(dx) == False]
# confirming that countries in the dx variable are gone
dataset.value_counts('country').sort_values(0).reset_index().head(5)
# ordered from least to greatest and checking the first 5 values
| country | 0 | |
|---|---|---|
| 0 | Afghanistan | 16 |
| 1 | Botswana | 16 |
| 2 | Algeria | 16 |
| 3 | Angola | 16 |
| 4 | Antigua and Barbuda | 16 |
#Filling NAs grouped by country in each column with a foward fill, followed by a backward fill.
dataset["alcohol"] = dataset.groupby("country")['alcohol'].transform(lambda x: x.fillna(method = 'ffill').bfill())
dataset["hepatitis_b"] = dataset.groupby("country")['hepatitis_b'].transform(lambda x: x.fillna(method = 'ffill').bfill())
dataset["bmi"] = dataset.groupby("country")['bmi'].transform(lambda x: x.fillna(method = 'ffill').bfill())
dataset["polio"] = dataset.groupby("country")['polio'].transform(lambda x: x.fillna(method = 'ffill').bfill())
dataset["total_expenditure"] = dataset.groupby\
("country")['total_expenditure'].transform(lambda x: x.fillna(method = 'ffill').bfill())
dataset["diphtheria"] = dataset.groupby("country")['diphtheria'].transform(lambda x: x.fillna(method = 'ffill').bfill())
dataset["gdp"] = dataset.groupby("country")['gdp'].transform(lambda x: x.fillna(method = 'ffill').bfill())
dataset["population"] = dataset.groupby("country")['population'].transform(lambda x: x.fillna(method = 'ffill').bfill())
dataset["income_composition_of_resources"] = dataset.groupby\
("country")['income_composition_of_resources'].transform(lambda x: x.fillna(method = 'ffill').bfill())
dataset["schooling"] = dataset.groupby("country")['schooling'].transform(lambda x: x.fillna(method = 'ffill').bfill())
dataset["thinness__1_19_years"] = dataset.groupby\
("country")['thinness__1_19_years'].transform(lambda x: x.fillna(method = 'ffill').bfill())
dataset["thinness_5_9_years"] = dataset.groupby\
("country")['thinness_5_9_years'].transform(lambda x: x.fillna(method = 'ffill').bfill())
import missingno as msno
#checking null value counts again
msno.bar(dataset)
<AxesSubplot:>
# testing my thought - checking for countries with all null values in the gdp column, which still has NAs
#grouping by country and counting the null values
gdp_null = dataset.gdp.isnull().groupby(dataset['country']).sum().reset_index()
#locating countries with more than 1 null value for gdp
gdp_null = gdp_null.loc[gdp_null['gdp'] > 1]
gdp_null.count()
country 25 gdp 25 dtype: int64
dataset.duplicated().value_counts()
#there are no duplicated rows
False 2928 dtype: int64
# Viewing distribution of all data
dataset.hist(alpha = .5, bins = 60, figsize = (20,50), layout=(10,2))
plt.tight_layout()
plt.show()
#counting the amount of developing and developed counties in the dataset
plt.figure(figsize=(12,8))
sns.countplot(data=dataset, x= 'status', order=dataset["status"].value_counts().index, palette= "husl")
<AxesSubplot:xlabel='status', ylabel='count'>
# assessing the differences between developed and developing countries
dataset.groupby('status').mean().reset_index()
| status | year | life_expectancy | adult_mortality | infant_deaths | alcohol | percentage_expenditure | hepatitis_b | measles | bmi | ... | polio | total_expenditure | diphtheria | hiv/aids | gdp | population | thinness__1_19_years | thinness_5_9_years | income_composition_of_resources | schooling | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Developed | 2007.5 | 79.197852 | 79.685547 | 1.494141 | 9.719336 | 2703.600380 | 83.302083 | 499.005859 | 51.803906 | ... | 93.736328 | 7.582148 | 93.476562 | 0.100000 | 22053.386446 | 6.830053e+06 | 1.320703 | 1.29668 | 0.852489 | 15.845474 |
| 1 | Developing | 2007.5 | 67.111465 | 182.833195 | 36.534768 | 3.422600 | 324.262018 | 74.617500 | 2836.618791 | 35.321351 | ... | 79.882450 | 5.577131 | 79.654801 | 2.096896 | 4217.160380 | 1.405700e+07 | 5.608725 | 5.65130 | 0.582092 | 11.225130 |
2 rows × 21 columns
# generating numeric correlation plot
dataset.corr()
| year | life_expectancy | adult_mortality | infant_deaths | alcohol | percentage_expenditure | hepatitis_b | measles | bmi | under_five_deaths | polio | total_expenditure | diphtheria | hiv/aids | gdp | population | thinness__1_19_years | thinness_5_9_years | income_composition_of_resources | schooling | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| year | 1.000000 | 0.170033 | -0.079052 | -0.036464 | -0.077391 | 0.032723 | 0.243332 | -0.081840 | 0.104668 | -0.041980 | 0.103324 | 0.089489 | 0.142635 | -0.138789 | 0.101017 | 0.016712 | -0.045082 | -0.048152 | 0.242953 | 0.213265 |
| life_expectancy | 0.170033 | 1.000000 | -0.696359 | -0.196557 | 0.402874 | 0.381864 | 0.319101 | -0.157586 | 0.567694 | -0.222529 | 0.460142 | 0.230900 | 0.474818 | -0.556556 | 0.461511 | -0.021371 | -0.477183 | -0.471584 | 0.724776 | 0.751975 |
| adult_mortality | -0.079052 | -0.696359 | 1.000000 | 0.078756 | -0.197702 | -0.242860 | -0.180219 | 0.031176 | -0.387017 | 0.094146 | -0.273500 | -0.127032 | -0.274592 | 0.523821 | -0.297521 | -0.013897 | 0.302904 | 0.308457 | -0.457626 | -0.454612 |
| infant_deaths | -0.036464 | -0.196557 | 0.078756 | 1.000000 | -0.114743 | -0.085906 | -0.219671 | 0.501038 | -0.227480 | 0.996628 | -0.166995 | -0.128293 | -0.171621 | 0.024955 | -0.108046 | 0.556815 | 0.465700 | 0.471340 | -0.145018 | -0.195202 |
| alcohol | -0.077391 | 0.402874 | -0.197702 | -0.114743 | 1.000000 | 0.335397 | 0.092529 | -0.050301 | 0.336358 | -0.111825 | 0.226894 | 0.301697 | 0.224955 | -0.047298 | 0.354459 | -0.034409 | -0.427377 | -0.416304 | 0.446448 | 0.540703 |
| percentage_expenditure | 0.032723 | 0.381864 | -0.242860 | -0.085906 | 0.335397 | 1.000000 | -0.001324 | -0.056831 | 0.231130 | -0.088152 | 0.148875 | 0.167788 | 0.145417 | -0.098230 | 0.899650 | -0.025576 | -0.252397 | -0.253931 | 0.382244 | 0.391466 |
| hepatitis_b | 0.243332 | 0.319101 | -0.180219 | -0.219671 | 0.092529 | -0.001324 | 1.000000 | -0.154028 | 0.216539 | -0.230773 | 0.487581 | 0.114038 | 0.588850 | -0.126628 | 0.068584 | -0.087231 | -0.165324 | -0.177734 | 0.276544 | 0.298144 |
| measles | -0.081840 | -0.157586 | 0.031176 | 0.501038 | -0.050301 | -0.056831 | -0.154028 | 1.000000 | -0.176069 | 0.507718 | -0.132613 | -0.103498 | -0.138292 | 0.030673 | -0.076057 | 0.265990 | 0.224579 | 0.220836 | -0.129465 | -0.138344 |
| bmi | 0.104668 | 0.567694 | -0.387017 | -0.227480 | 0.336358 | 0.231130 | 0.216539 | -0.176069 | 1.000000 | -0.237910 | 0.280603 | 0.239904 | 0.278867 | -0.243735 | 0.303799 | -0.071633 | -0.530805 | -0.537784 | 0.509299 | 0.558363 |
| under_five_deaths | -0.041980 | -0.222529 | 0.094146 | 0.996628 | -0.111825 | -0.088152 | -0.230773 | 0.507718 | -0.237910 | 1.000000 | -0.184896 | -0.130034 | -0.191998 | 0.037783 | -0.111872 | 0.544437 | 0.467771 | 0.472244 | -0.163185 | -0.210945 |
| polio | 0.103324 | 0.460142 | -0.273500 | -0.166995 | 0.226894 | 0.148875 | 0.487581 | -0.132613 | 0.280603 | -0.184896 | 1.000000 | 0.148531 | 0.679621 | -0.156445 | 0.217112 | -0.036116 | -0.218594 | -0.219433 | 0.389327 | 0.424306 |
| total_expenditure | 0.089489 | 0.230900 | -0.127032 | -0.128293 | 0.301697 | 0.167788 | 0.114038 | -0.103498 | 0.239904 | -0.130034 | 0.148531 | 1.000000 | 0.157033 | -0.005506 | 0.139612 | -0.077157 | -0.277092 | -0.284654 | 0.183496 | 0.277368 |
| diphtheria | 0.142635 | 0.474818 | -0.274592 | -0.171621 | 0.224955 | 0.145417 | 0.588850 | -0.138292 | 0.278867 | -0.191998 | 0.679621 | 0.157033 | 1.000000 | -0.162214 | 0.206914 | -0.026114 | -0.225675 | -0.219046 | 0.410487 | 0.433396 |
| hiv/aids | -0.138789 | -0.556556 | 0.523821 | 0.024955 | -0.047298 | -0.098230 | -0.126628 | 0.030673 | -0.243735 | 0.037783 | -0.156445 | -0.005506 | -0.162214 | 1.000000 | -0.135058 | -0.027818 | 0.203550 | 0.206772 | -0.249380 | -0.222214 |
| gdp | 0.101017 | 0.461511 | -0.297521 | -0.108046 | 0.354459 | 0.899650 | 0.068584 | -0.076057 | 0.303799 | -0.111872 | 0.217112 | 0.139612 | 0.206914 | -0.135058 | 1.000000 | -0.027784 | -0.288711 | -0.293229 | 0.457725 | 0.445368 |
| population | 0.016712 | -0.021371 | -0.013897 | 0.556815 | -0.034409 | -0.025576 | -0.087231 | 0.265990 | -0.071633 | 0.544437 | -0.036116 | -0.077157 | -0.026114 | -0.027818 | -0.027784 | 1.000000 | 0.253449 | 0.250954 | -0.008319 | -0.031193 |
| thinness__1_19_years | -0.045082 | -0.477183 | 0.302904 | 0.465700 | -0.427377 | -0.252397 | -0.165324 | 0.224579 | -0.530805 | 0.467771 | -0.218594 | -0.277092 | -0.225675 | 0.203550 | -0.288711 | 0.253449 | 1.000000 | 0.938953 | -0.422210 | -0.477434 |
| thinness_5_9_years | -0.048152 | -0.471584 | 0.308457 | 0.471340 | -0.416304 | -0.253931 | -0.177734 | 0.220836 | -0.537784 | 0.472244 | -0.219433 | -0.284654 | -0.219046 | 0.206772 | -0.293229 | 0.250954 | 0.938953 | 1.000000 | -0.410825 | -0.466334 |
| income_composition_of_resources | 0.242953 | 0.724776 | -0.457626 | -0.145018 | 0.446448 | 0.382244 | 0.276544 | -0.129465 | 0.509299 | -0.163185 | 0.389327 | 0.183496 | 0.410487 | -0.249380 | 0.457725 | -0.008319 | -0.422210 | -0.410825 | 1.000000 | 0.800046 |
| schooling | 0.213265 | 0.751975 | -0.454612 | -0.195202 | 0.540703 | 0.391466 | 0.298144 | -0.138344 | 0.558363 | -0.210945 | 0.424306 | 0.277368 | 0.433396 | -0.222214 | 0.445368 | -0.031193 | -0.477434 | -0.466334 | 0.800046 | 1.000000 |
# plottiing the numerical correlation against life_expectancy into a bar chart
pd.DataFrame(abs(dataset.corr()['life_expectancy'].\
drop('life_expectancy')*100).sort_values(ascending=False)).plot.bar(figsize = (12,8))
plt.yticks(size = 15)
plt.xticks(size = 12)
plt.show()
# generating correlation matrix
plt.figure(figsize=(20,12))
sns.heatmap(dataset.corr(),annot=True)
<AxesSubplot:>
sns.pairplot(dataset)
<seaborn.axisgrid.PairGrid at 0x7ff228dae6d0>
%matplotlib widget
# adding list of countries ot a list
country = np.unique(dataset.country.tolist()).tolist()
# extract color palette into a list, the palette can be changed
pal = list(sns.color_palette(palette=('Set2'), n_colors=len(country)).as_hex())
fig = go.Figure()
# looping through each country in the dataset, and plotting eaach data point for each year
for d,p in zip(country, pal):
fig.add_trace(go.Scatter(x = dataset[dataset['country']==d]['year'],
y = dataset[dataset['country']==d]['life_expectancy'],
#visible = 'legendonly',
name = d,
line_color = p,
fill=None)) #tozeroy
# The code in below will generate an interactive plot. You can double-click the legend to display
#all data or remove all data click on a single (or multiple) country name in the legend to display data.
fig.show()
# modeling life_expectancy and year below via linear regression
year_model = smf.ols(formula = 'life_expectancy ~ year',
data = dataset).fit()
print(year_model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: life_expectancy R-squared: 0.029
Model: OLS Adj. R-squared: 0.029
Method: Least Squares F-statistic: 87.11
Date: Thu, 29 Dec 2022 Prob (F-statistic): 1.96e-20
Time: 22:29:51 Log-Likelihood: -10710.
No. Observations: 2928 AIC: 2.142e+04
Df Residuals: 2926 BIC: 2.144e+04
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -635.8715 75.546 -8.417 0.000 -783.999 -487.744
year 0.3512 0.038 9.333 0.000 0.277 0.425
==============================================================================
Omnibus: 178.257 Durbin-Watson: 0.151
Prob(Omnibus): 0.000 Jarque-Bera (JB): 190.835
Skew: -0.593 Prob(JB): 3.64e-42
Kurtosis: 2.606 Cond. No. 8.74e+05
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.74e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
year_model.pvalues[0]
5.957756451252083e-17
px.scatter(dataset, x='life_expectancy',y='schooling',color='country',
size = 'year',title='Life Expectancy and Schooling')
# modeling life_expectancy and population below via linear regression
schooling_model = smf.ols(formula = 'life_expectancy ~ schooling',
data = dataset).fit()
print(schooling_model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: life_expectancy R-squared: 0.565
Model: OLS Adj. R-squared: 0.565
Method: Least Squares F-statistic: 3599.
Date: Thu, 29 Dec 2022 Prob (F-statistic): 0.00
Time: 22:30:17 Log-Likelihood: -8964.3
No. Observations: 2768 AIC: 1.793e+04
Df Residuals: 2766 BIC: 1.794e+04
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 44.1089 0.437 100.992 0.000 43.252 44.965
schooling 2.1035 0.035 59.995 0.000 2.035 2.172
==============================================================================
Omnibus: 283.391 Durbin-Watson: 0.267
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1122.013
Skew: -0.445 Prob(JB): 2.28e-244
Kurtosis: 5.989 Cond. No. 46.7
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
#creating new dataset without the NAs in schooling.
prediction_data = dataset[['life_expectancy', 'schooling']].dropna()
# Test train split for supervised learning
X_train, X_test, y_train, y_test = train_test_split(prediction_data.life_expectancy, prediction_data.schooling)
# Test Train split visualization
plt.figure(figsize=(9,6))
plt.scatter(X_train, y_train, label = 'Training Data',color ='r', alpha =.7)
plt.scatter(X_test, y_test, label = 'Testing Data', color ='g', alpha =.7)
plt.legend()
plt.title('Life expectancy and Schooling')
plt.xlabel('Age')
plt.ylabel('Schooling')
plt.show()
# create linear model and train it
LR = LinearRegression()
LR.fit(X_train.values.reshape(-1,1), y_train.values)
LinearRegression()
# Use model to predict on test data
prediction = LR.predict(X_test.values.reshape(-1,1))
#polt prediction line against actual test data
plt.figure(figsize=(9,6))
plt.plot(X_test, prediction, label = 'Linear Regression', color = 'b')
plt.scatter(X_test, y_test, label = 'Actual Test Data',color= 'g',alpha= .7)
plt.title('Life expectancy and Schooling')
plt.xlabel('Average Life Expectancy by Country')
plt.ylabel('Schooling')
plt.legend()
plt.show()
# putting age 55 into the prediction. The output if the school level (in years of schooling)
print('A country with a life expectancy of 55 is predicted to experience about',
round(LR.predict([[55]])[0],2), 'years of schooling')
A country with a life expectancy of 55 is predicted to experience about 8.17 years of schooling
px.scatter(dataset, x='life_expectancy',y='adult_mortality',color='country',
size = 'year',title='Life Expectancy and Adult Mortality')
# modeling life_expectancy and adult_mortality below via linear regression
adult_model = smf.ols(formula = 'life_expectancy ~ adult_mortality',
data = dataset).fit()
print(adult_model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: life_expectancy R-squared: 0.485
Model: OLS Adj. R-squared: 0.485
Method: Least Squares F-statistic: 2755.
Date: Thu, 29 Dec 2022 Prob (F-statistic): 0.00
Time: 22:30:19 Log-Likelihood: -9782.0
No. Observations: 2928 AIC: 1.957e+04
Df Residuals: 2926 BIC: 1.958e+04
Df Model: 1
Covariance Type: nonrobust
===================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------
Intercept 78.0182 0.210 371.804 0.000 77.607 78.430
adult_mortality -0.0534 0.001 -52.485 0.000 -0.055 -0.051
==============================================================================
Omnibus: 1021.341 Durbin-Watson: 0.762
Prob(Omnibus): 0.000 Jarque-Bera (JB): 3874.676
Skew: -1.703 Prob(JB): 0.00
Kurtosis: 7.490 Cond. No. 343.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
px.scatter(dataset, x='life_expectancy',y='infant_deaths',color='country',
size = 'year',title='Life Expectancy and Infant Deaths')
# modeling life_expectancy and infant_deaths below via linear regression
infant_deaths_model = smf.ols(formula = 'life_expectancy ~ infant_deaths',
data = dataset).fit()
print(infant_deaths_model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: life_expectancy R-squared: 0.039
Model: OLS Adj. R-squared: 0.038
Method: Least Squares F-statistic: 117.6
Date: Thu, 29 Dec 2022 Prob (F-statistic): 6.88e-27
Time: 22:30:21 Log-Likelihood: -10696.
No. Observations: 2928 AIC: 2.140e+04
Df Residuals: 2926 BIC: 2.141e+04
Df Model: 1
Covariance Type: nonrobust
=================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept 69.7069 0.178 391.102 0.000 69.357 70.056
infant_deaths -0.0158 0.001 -10.844 0.000 -0.019 -0.013
==============================================================================
Omnibus: 166.073 Durbin-Watson: 0.162
Prob(Omnibus): 0.000 Jarque-Bera (JB): 193.261
Skew: -0.624 Prob(JB): 1.08e-42
Kurtosis: 2.837 Cond. No. 126.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# modeling life_expectancy and population below via linear regression
pop_model = smf.ols(formula = 'life_expectancy ~ population',
data = dataset).fit()
print(pop_model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: life_expectancy R-squared: 0.000
Model: OLS Adj. R-squared: 0.000
Method: Least Squares F-statistic: 1.045
Date: Thu, 29 Dec 2022 Prob (F-statistic): 0.307
Time: 22:30:22 Log-Likelihood: -8472.5
No. Observations: 2288 AIC: 1.695e+04
Df Residuals: 2286 BIC: 1.696e+04
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 68.7216 0.210 327.628 0.000 68.310 69.133
population -3.442e-09 3.37e-09 -1.022 0.307 -1e-08 3.16e-09
==============================================================================
Omnibus: 115.164 Durbin-Watson: 0.171
Prob(Omnibus): 0.000 Jarque-Bera (JB): 120.077
Skew: -0.529 Prob(JB): 8.43e-27
Kurtosis: 2.625 Cond. No. 6.36e+07
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.36e+07. This might indicate that there are
strong multicollinearity or other numerical problems.
px.scatter(dataset, x='life_expectancy',y='income_composition_of_resources',color='country',
size = 'year',title='Life Expectancy and Income')
# modeling life_expectancy and income below via linear regression
income_model = smf.ols(formula = 'life_expectancy ~ income_composition_of_resources',
data = dataset).fit()
print(income_model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: life_expectancy R-squared: 0.525
Model: OLS Adj. R-squared: 0.525
Method: Least Squares F-statistic: 3061.
Date: Thu, 29 Dec 2022 Prob (F-statistic): 0.00
Time: 22:30:23 Log-Likelihood: -9086.7
No. Observations: 2768 AIC: 1.818e+04
Df Residuals: 2766 BIC: 1.819e+04
Df Model: 1
Covariance Type: nonrobust
===================================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------------------
Intercept 49.1735 0.385 127.809 0.000 48.419 49.928
income_composition_of_resources 32.1572 0.581 55.325 0.000 31.018 33.297
==============================================================================
Omnibus: 303.292 Durbin-Watson: 0.338
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1554.927
Skew: 0.395 Prob(JB): 0.00
Kurtosis: 6.586 Cond. No. 6.67
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
px.scatter(dataset, x='life_expectancy',y='gdp',color='country',
size = 'year',title='Life Expectancy and GDP')
# modeling life_expectancy and income below via linear regression
gdp_model = smf.ols(formula = 'life_expectancy ~ gdp',
data = dataset).fit()
print(gdp_model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: life_expectancy R-squared: 0.213
Model: OLS Adj. R-squared: 0.213
Method: Least Squares F-statistic: 683.6
Date: Thu, 29 Dec 2022 Prob (F-statistic): 1.44e-133
Time: 22:30:25 Log-Likelihood: -9026.2
No. Observations: 2528 AIC: 1.806e+04
Df Residuals: 2526 BIC: 1.807e+04
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 66.8970 0.193 346.913 0.000 66.519 67.275
gdp 0.0003 1.21e-05 26.146 0.000 0.000 0.000
==============================================================================
Omnibus: 156.820 Durbin-Watson: 0.378
Prob(Omnibus): 0.000 Jarque-Bera (JB): 186.682
Skew: -0.666 Prob(JB): 2.90e-41
Kurtosis: 2.986 Cond. No. 1.80e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.8e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
px.scatter(dataset, x='life_expectancy',y='alcohol',color='country',
size = 'year',title='Life Expectancy and Alcohol')
# modeling life_expectancy and income below via linear regression
alcohol_model = smf.ols(formula = 'life_expectancy ~ alcohol',
data = dataset).fit()
print(alcohol_model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: life_expectancy R-squared: 0.162
Model: OLS Adj. R-squared: 0.162
Method: Least Squares F-statistic: 563.8
Date: Thu, 29 Dec 2022 Prob (F-statistic): 4.49e-114
Time: 22:30:27 Log-Likelihood: -10423.
No. Observations: 2912 AIC: 2.085e+04
Df Residuals: 2910 BIC: 2.086e+04
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 65.0583 0.241 270.352 0.000 64.587 65.530
alcohol 0.9385 0.040 23.745 0.000 0.861 1.016
==============================================================================
Omnibus: 270.852 Durbin-Watson: 0.199
Prob(Omnibus): 0.000 Jarque-Bera (JB): 348.774
Skew: -0.831 Prob(JB): 1.84e-76
Kurtosis: 3.335 Cond. No. 9.25
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.